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INTELLIGENCE TESTS AND THE CLASSIFICATION OF 

PUPILS. I 



F. S. BREED and E. R. BRESLICH 
University of Chicago 



One far-reaching proposal that has been made in connection 
with intelligence tests is that they be used as the basis for classify- 
ing or grading pupils in the public schools. Reports are being 
received with considerable frequency from high schools where 
sections or classes of different levels of ability are being organized 
in various subjects by means of group tests of intelligence. Pro- 
gressive high-school principals and teachers are naturally making 
many inquiries about this movement. They are seeking informa- 
tion with regard to the general nature of the tests, the technique 
of administering them, and the methods of scoring. They are also 
inquiring which of the various tests designed for high-school use 
provide the most reliable measures of intellectual capacity, what 
degree of accuracy the best of these displays, whether the most 
accurate measures of intelligence constitute a satisfactory basis 
for the classification of pupils, to what extent emotional and 
volitional factors such as interest and industry determine a 
pupil's educational achievement, and whether, after all, educational 
achievement, rather than intelligence, is not the most satisfactory 
basis of classification. 

All of these questions were in the minds of the writers during 
the course of the present investigation. An attempt will be made 
to provide answers to some of them in this article. Part I deals 
primarily with the reliability of intelligence tests as the basis for 
determining the intelligence of high-school pupils. If intelligence is a 
satisfactory basis for classification, it becomes important to deter- 
mine as accurately as possible the error of classification that arises 
from imperfections in the measuring instruments. Part II will 
discuss the reliability of intelligence tests as the basis for predicting 
the educational achievement of pupils. This involves an examina- 
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tion of the assumption that pupils should be classified on the basis 
of their intelligence. 

The investigation upon which this article is based was conducted 
during the year 1920-21 in the University High School, University 
of Chicago. This report takes its place as one of a considerable 
number which have been published, dealing with the same general 
topic, and hopes to justify itself by its rather intensive study of 
the problem in a single group of sixty ninth-grade pupils. Parallel 
and corroborative studies were made in a group of fifty-four seventh- 
grade pupils in connection with numerous subsidiary problems. 

THE RELIABILITY OF SELECTED GROUP TESTS OF INTELLIGENCE 

The tests selected for use were the Chicago Group Intelligence 
Test, Form A; the Otis Group Intelligence Test, Advanced Exam- 
ination, Form A; and the Terman Group Test of Mental Ability, 
Form A. These tests were chosen because they are especially 
designed for the measurement of intelligence in secondary schools. 
Furthermore, they were regarded by the present writers as among 
the best for this purpose. These tests, as well as all others used in 
this study, were administered by Mr. Breslich, the intelligence tests 
being given near the beginning of the first semester. Unusual 
care was observed in administering the tests, for the results of such 
measurements are too often invalidated by failure to give the tests 
properly. All of the tests paper were scored with similar care by 
a trained assistant. 

Inter-test correlations. — One of the obvious methods of throwing 
light on the question of reliability is to determine the extent to 
which the tests agree in their measurements of the same individuals. 
Disagreement in such measurements might be due to (a) difference 
among the authors of the tests in their conceptions of intelligence, 
(b) failure of the testing instruments to measure accurately the 
thing intended by the authors, and (c) variability of pupils from 
test to test. It is assumed that the third factor will always make 
perfect agreement between tests of intelligence impossible when 
single measures only are considered. If disagreement is due to 
either the first or the second of the foregoing factors, it may be 
regarded as an indictment of the tests. They all purport to meas- 
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ure general intelligence, and whether their failure to do so ac- 
curately is attributable to wrong conceptions of the thing to be 
measured or wrong construction of the thing to measure with 
matters little to the teachers and supervisors. Their verdict will 
be that tests which thus disagree must be improved through better 
understanding of the nature of intelligence and a more skilful 
construction of the instruments for measuring it. 

The extent to which these tests agree in their measures of the 
same intelligences will be indicated by the coefficient of correlation. 
In a very real sense this coefficient may be considered a coefficient 
of reliability, for repeated attempts to measure the same thing are 
being compared. The coefficients of correlation here and else- 
where in the report were computed by the product-moment method. 

Table I shows the correlation between tests in both the seventh 
and the ninth grades. It will be observed that the correlation is 



TABLE I 

Correlation between Intelligence Tests in the Seventh 
and Ninth Grades 



Tests 


Seventh Grade 


Ninth Grade 


Coefficient 


P.E. 


Coefficient 


P.E. 




.69 

• 77 
■ 74 


.048 
.038 
.042 


•77 
.78 
•85 


•034 
•°3S 
.024 









high, according to accepted standards for correlations. The 
lowest of these coefficients is .69, the highest .85, and the average 
.77. It is evident that there is a great deal of agreement among the 
three tests. The Chicago and Terman tests show the least agree- 
ment, the Otis and Terman the greatest. If the analysis were not 
carried further, this relatively high correlation might be misinter- 
preted. The practical question here arises: What difference should 
be expected between classifications of pupils based on tests which 
agree to the extent shown in Table I ? An answer to the question 
is found in Table II, which shows the disparity between the results 
from different tests in terms of the amount of pupil displacement 
The amount of such displacement was determined for the cases 
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of lowest, average, and highest correlation. To use as an illustra- 
tion the Chicago-Terman correlation found in the middle of the 
table, where the coefficient happens to be identical with the average, 
the procedure of determining the number of displacements was as 
follows. The pupils of the ninth grade were ranked from lowest to 
highest according to their Chicago scores. The group was then 
divided into three equal sections, making the number of divisions 
the same as the actual number of sections in first-year mathematics 
in the University High School. Similarly, the pupils were ranked 
by their Terman scores, and comparison was made to determine 
the number in each Chicago tertile who were in a different 
tertile according to the Terman scores. The number of such 
differences or displaced individuals is shown in the table in the 
column headed "Number of Displacements." It will be noticed 
that the percentage of pupils displaced through disparity between 
tests is exactly 30 for the case in which the coefficient of correlation 
is equal to the average. This means that of the sixty ninth-grade 
pupils, classified into three sections by the Chicago test, eighteen 
were found to be in different sections from those in which they 
would have been if classified by the Terman test. If, as would be 
done in some high schools, the group of sixty pupils were here 
divided into two equal sections instead of three, the percentage 
of displacements would be reduced to 23. With the higher correla- 
tion of .85 in the case of the Terman-Otis comparison, the displace- 
ment percentage for the tertile grouping dropped to 21.7. For a 
division of the group into two sections of thirty each, the percentage 
of pupils displaced would be 20 instead of 21.7. 

In the right-hand column of Table II the displacement is 
represented in "units," by which is meant the number of sections 
or tertile-steps displaced. Since a pupil might be displaced two 
steps instead of one, it seemed expedient to present the results in 
terms of these units. Displacement by more than one unit was 
not found to be common. There were only six individuals whose 
classification by one test located them two sections away from 
their classification by another test. For the total number of 
forty-eight displacements in the three comparisons, there were 
found to be fifty-four units. 
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The results here discussed are exhibited graphically in Figures i, 
2, and 3, following the order in which the data are presented in the 
table. Each square in these column diagrams represents a pupil. 
The classification is made in each of the three figures on the basis of 

TABLE II 

Amount of Pupil Displacement Accompanying 
Various' Degrees of Correlation 



Tertile 



NtJMBER OF 

Displacements 



Units of 
Displacement 



Grade Vn. Chicago-Terman Tests 
r*=.6g; iV=54 





4 

7 
6 


6 


2 


7 




7 






Total 


17 
3i-5 


20 


Percentage 









Grade IX. Chicago-Terman Tests 
r— .77; N—60 





4 
8 
6 


4 




8 




8 






Total 


18 
3° 


20 











Grade IX. 



Terman-Otis Tests 
.85; iV=6o 





4 
6 

3 


4 




6 




4 






Total 


13 
21.7 


14 











the test first mentioned. An individual's score in this test appears 
uppermost in the square. Immediately below is his score in the 
compared test. Hatched squares indicate displaced pupils. The 
heavy lines mark the separation of the sections. 

It may be objected at this point that these results convey a 
misleading impression of the reliability of the intelligence tests, 
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since in each case the series of scores is compared not with a series 
of true scores but with a series subject to error. In other words, in 
no case is the error of displacement due to only one of the tests. 
The objection is valid. In order to throw some light on this point, 
a series of composite intelligence scores was derived from the three 
intelligence tests. The method of derivation will be fully explained 
in Part II. Suffice it to say at this juncture that the composite 
series may be assumed to represent a nearer approach to the true 
values than any single series of scores, by the same argument that 
the average of several expert attempts at the measurement of an 
object probably approaches more closely the true measure than 
any one attempt. 

The correlation between Otis scores in the ninth grade and the 
composite scores was found to be .92. This coefficient was accom- 
panied by a pupil displacement of 13 per cent when the group was 
divided into three sections of twenty each, and 10 per cent when it 
was divided into two sections of thirty each. This indicates greater 
reliability for the Otis test than is apparent in Table II. The dis- 
placement for three sections is represented graphically in Figure 4. 
For the purpose of obtaining a more general notion of displace- 
ment as measured with reference to the composite intelligence 
scores, similar data were secured for the Chicago and Terman tests. 
It was found that the displacement amounted to 20 per cent for 
each of these tests. There was, therefore, an average displacement 
or error of classification amounting to about 18 per cent, due to 
the variability of pupils and the inaccuracy of the measuring 
instrument. 

Mean difference between scores for the same pupil in different 
tests. — In order to probe farther into the question of test reliability, 
a study was made of the disparity in scores for the same individual. 
It is important for the teacher and supervisor to understand the 
degree of unreliability of the tests, not only in terms of a certain 
expectation of misplaced pupils, but also in terms of the expecta- 
tion of error in individual scores. 

This expectation of error has been computed in terms of the 
average difference between a pupil's scores in two tests. In order 
to compare scores of the same individuals on different scales, 



i&22] INTELLIGENCE TESTS AND CLASSIFICATION 59 

scores on one scale were transmuted into their equivalents on 
another scale. This was done by means of the formulas 

cry 
y=r—x 

<JX 

ax 
x=r — y 
<ry 

in which y represents deviations from the median of the Y scale, x 
deviations from the median of the X scale, r the Pearson coefficient 
of correlation, and <r the standard deviation of a distribution. The 
value of r is unity in the case of perfect correlation. By assuming 
the correlation between two scales to be perfect and substituting in 
the foregoing general formulas the values for <r, the following spe- 
cific formulas 1 were obtained: 

(i) Chicago-Otis transmutation: 

y c =-54X 
x„ = i.8sy c 

(2) Chicago-Terman transmutation: 

y c =.42X t 
x t =2.tfy c 

(3) Otis-Terman transmutation: 

y =.78x t 
x t =i.28y 

Thus, the first equation transmutes deviations on the Otis scale 
into Chicago units by multiplying each deviation by .54. 

The meaning of these transmutations is illustrated graphically 
in Figure 5. The Chicago scale is laid off vertically, and the 
Otis scale horizontally. A pupil's scores on the two scales are 
represented by the point of intersection of a vertical and a horizontal 
line. For example, if a pupil's scores are 59.5 on the Chicago test 
and 185 on the Otis test, the point A corresponding to these scores 
is found by passing from the 59.5 mark on the Chicago scale to the 
vertical line passing through the 185 mark on the Otis scale. 

1 In these formulas the subscripts indicate the three scales. 
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The lines RR' and CC represent the regression equations. 
Line PP' is the line of perfect correlation and represents the 
equations 

y c =-54%o 
x = i.8sy c 

In the case of perfect correlation all points would fall on the line 
PP' instead of being scattered. The heavy lines My and Mx 
indicate the means of the series. If for any particular individual 
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Fig. 5. — Scatter diagram to illustrate disparity between the Chicago and Otis 
intelligence tests in the ninth grade. Mx and My represent the means of the Otis 
and Chicago distributions, respectively. CC and RR' are the lines of regression. 
PP' is the line of perfect correlation. The broken lines parallel to the lines representing 
the means indicate standard deviation. If the correlation between the tests were 
perfect, the score of pupil A on the Otis scale, 185, would be accompanied by a score 
of 73 on the Chicago scale. As a matter of fact, pupil A has a score of 59.5 on the 
Chicago scale. The disparity between the transmuted Otis score and the actual 
Chicago score is therefore 13.5. 



the distance from the mean of one of the distributions is given, 
the formulas just given determine the distance he should be from 
the mean of the other distribution. For example, pupil A is 33 
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points above the mean on the Otis scale. His transmuted score 
is .54X33, or 18 points, above the mean on the Chicago scale, or 73. 

In this manner actual Chicago scores were compared with trans- 
muted Otis scores of the same individuals, actual Otis scores were 
compared with transmuted Chicago scores, and so on for all com- 
binations of the three tests and for both the seventh and the ninth 
grades. In order that the results might be as accurate as possible, 
the equivalent of each deviation was obtained by a separate applica- 
tion of the proper formula. 

After the process of transmutation had been completed for 
any two tests, the difference was found between the two scores 
for each pupil. This difference is the significant value sought. 
Perfect measurement of a constant quality or trait would show no 
difference between two scores of this kind. Such a difference 
is a symptom of inaccuracy of measurement or of variability of 
the thing measured, or of both. It represents for the teacher the 
amount by which he may expect two of these tests to differ in the 
measurement of the same pupil. This difference for pupil A is 
the difference between his transmuted score, 73, and his actual 
score, 59.5, and is equal to 13.5. 

A summary of the results of the transmutations is presented in 
Table III. The table should be read as follows: When the Otis 

TABLE III 

Average Difference between Scores for an Individual 
in Two Different Tests 



Tests 


Seventh 
Grade 


Ninth 
Grade 


Average 




5-3 
5-9 

II. O 
11. 7 

13.8 
13-4 


6.4l 
6.5/ 

11. 81 
10.0/ 

15 -sl 

12.8/ 






6.0 




II. I 




13-9 







scores for the seventh grade were transmuted into Chicago equiva- 
lents and the differences found between these Chicago equivalents 
and the actual Chicago scores for the same pupils, these differences 
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for the group averaged 5.3 points on the Chicago scale. An average 
of the four average differences for each scale is presented. In the 
measurement of the intelligence of an individual pupil in these 
two grades, therefore, it is observed that the tests differed on the 
average 6 points when the difference was measured on the Chi- 
cago scale, 1 1. 1 points when measured on the Otis scale, and 13.9 
points when measured on the Terman scale. 

It should be borne in mind that these figures represent average 
expectation of disparity in regard to individual scores. In addi- 
tion to the average expectation, teachers should know also some- 
thing of the range of these differences. Especially should they 
know how large in certain cases the difference might be. 

In Table IV are shown the minimum and maximum differences 
throughout. The minimum differences naturally tend to approxi- 
mate zero and are not especially important in this discussion. The 

TABLE IV 

Difference between Two Scores for the Same Pupil in Two Different Tests 



Scales 



Seventh Grade 


Ninth 


Minimum 


Maximum 


Minimum 


.11 


17.00 


.06 


•25 


22.38 


•31 


.22 


35 05 


. II 


•65 


43-23 


•°3 


• 59 


52.61 


.72 


.68 


49-32 


.04 



Maximum 



Otis on Chicago 
Terman on Chicago 

Chicago on Otis 
Terman on Otis 

Chicago on Terman 
Otis on Terman 



18.64 
20.49 

34-51 
29.63 

48.65 
37-96 



maximum differences were: 22.38 points in terms of the Chicago 
scale, 43.23 points in terms of the Otis scale, and 52.61 points in 
terms of the Terman scale. These differences are important so 
far as the use of these intelligence tests is concerned, and decidedly 
call for caution in the interpretation of measurements of individual 
pupils. 

A further analysis of the results shows that the twelve maximum - 
difference values given in Table IV represent only five different 
pupils. Of the fifteen scores of these individuals in the three 
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tests, all were below the median in the Chicago test, three were 
below the median in the Otis test, and one was below the median 
in the Terman test. Four of these five pupils were below the 
median in two of the three tests. In other words, the most variable 
scores were obtained for weak rather than strong pupils. This 
conclusion was confirmed by an examination of the scores received 
by the six pupils whose measures show the minimum differences. 
Of these pupils, only two were below the median in two tests, and 
these were only slightly below. 

In order to determine the frequency of such large differences 
between scores as have been referred to, a tabulation was made of 
all differences over 20 between two typical series, the actual Otis 
and the transmuted Chicago scores. There were found to be ten 
such differences in the ninth grade and nine in the seventh. In each 
group, therefore, one-sixth of the total number of pupils had scores 
in the Chicago and Otis tests differing by more than 20 points as 
measured on the Otis scale. 

Data on the relative reliability of the tests. — When teachers or 
supervisors are confronted with the problem of classification and 
are contemplating the use of general intelligence tests, they want 
to know not only something about the accuracy of these tests in 
general, but also something about their relative accuracy. Persons 
who are making a study of such tests are repeatedly asked which 
is best for a given purpose. Data tending to provide an answer 
to this query, so far as the present tests are concerned, are found 
in Table IV and in the following discussion. 

By reference to Table III it will be seen that the Otis transmuted 
scores and the Chicago actual scores in the seventh grade differ 
on the average by 5.3 points. It should be observed further that 
the Terman transmuted and the Chicago actual scores differ on 
the average by 5.9. While the difference between 5.3 and 5.9 
appears small, it is nevertheless significant. It corresponds, for 
example, to the difference that exists between the correlation 
coefficients .77 and .69, the smaller difference accompanying the 
higher correlation. A similar result is found in the ninth grade. 
These figures indicate that the Otis and Chicago tests are in closer 
agreement than the Terman and Chicago tests, a finding consistent 



64 THE SCHOOL REVIEW [January 

with the correlation data in Table I. According to the data presen- 
ted in Table III in the comparison of Chicago and Otis transmuted 
with Terman actual scores, it is observed that the Otis test is in 
closer agreement with the Terman test than is the Chicago test 
with the Terman. The smaller average differences in both grades 
indicate this. The Otis test may therefore be said to occupy an 
intermediate position between the other two tests. In other words, 
results secured from the Otis test are found to be closer to the 
results of each of the other tests than these are to each other. 
And while average position where only three cases are involved 
may not mean much, this fact regarding the position of the Otis 
test is offered for consideration in connection with the facts which 
immediately follow. 

Variation of the deviations of individual scores from deviations 
assumed to be true. — It is the purpose of this section of the report 
to present indices that may throw more light on the problem of 
relative reliability of the tests. These indices, appearing in Table 
V, are expressed in comparable terms, each representing for any 

TABLE V 

Indices of Variation, for the Three Tests, from 
Values Assumed to Be True 



Test 


Grade 


Seventh 


Ninth 


Chicago 


.31a 
•370- 


■ 32<r 
.27<r 
.2&<r 


Otis 


Terman 







one test the average amount that an individual's deviation from 
the mean of a group varies from the average of his deviations 
in the three tests. The method will become clearer from the follow- 
ing statement of steps governing the computation: (1) compute 
the mean for each series of scores, (2) find the deviation from the 
mean for each individual score, (3) reduce deviations to a com- 
parable basis by dividing each by the standard deviation of its 
distribution, (4) find the mean of the three comparable deviations 
for each individual score, (5) assume this to be the true deviation 
of the individual score, (6) subtract each deviation from the 
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deviation assumed to be true, (7) find the mean of these differences 
for each test. 

The Otis test was found to have the smallest index of variation in 
both grades, .31a in the seventh and .270- in the ninth. In estimat- 
ing the amount of difference between the various indices, the fact 
should be considered that each sigma is a unit of considerable 
magnitude, representing between one-fourth and one-fifth of the 
total range of scores, according to the test; also that the probable 
error of averages such as these is very small. The differences 
between indices are therefore regarded as significant. These 
results are in agreement with the previous findings in regard to the 
Otis test and tend to favor it as a testing instrument. 

SUMMARY AND CONCLUSIONS 

i. This investigation embraces two problems: (a) the reliability 
of intelligence tests as the basis for determining the intelligence 
of pupils, and hence for classifying them according to intelligence; 
(b) the reliability of intelligence tests as the basis for predicting 
the educational achievement of pupils, and hence classifying them 
for school work. The first of these problems is here discussed; 
the second will be discussed in Part II. 

2. Three intelligence tests were administered to the same groups 
of pupils: (a) the Chicago Group Intelligence Test, Form A; 
(b) the Otis Group Intelligence Test, Advanced Examination, Form 
A; (c) the Terman Group Test of Mental Ability, Form A. 

3. The average inter-test correlation was .77. According to 
ordinary standards, this coefficient would be regarded as high, and 
without further analysis might become the basis for the conclusion 
that classification of pupils based upon one of these tests would 
differ in no essential way from classification based upon another. 

4. The average inter-test correlation was accompanied by a 
pupil displacement of 30 per cent. In other words, 30 per cent of 
the pupils classified by one test were found to be out of place accord- 
ing to another. 

5. The average displacement for the three tests was 18 per cent, 
as measured by comparison with a series of composite intelligence 
scores, on an enrolment basis of twenty pupils per section. This 
means that between one-fifth and one-sixth of the pupils were not 
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properly classified according to intelligence by the test, as judged 
by the criterion of composite scores. It will be observed that this 
percentage is somewhat decreased by increasing the enrolment of 
sections. The standard for maximum size of class recommended 
by the North Central Association is twenty-five. 

6. The average disparity between individual scores for the same 
pupils in two different tests was found to be 6 points when measured 
on the Chicago scale, ii.i points when measured on the Otis scale, 
and 13.9 points when measured on the Terman scale. An exam- 
ination of Otis-Chicago data revealed the fact that one-sixth of the 
total number of pupils tested received scores differing from each 
other by 20 or more Otis points. This degree of variability in the 
results of measurement calls for great caution in the use of these 
tests for the purpose of classifying pupils according to intelligence. 
It would seem that no serious attempt at such classification should 
be made in any high school without the use of at least two good 
group tests, supplemented by additional testing where marked 
disagreement between tests is found. 

7. (a) The Otis test agreed more closely with each of the other 
two tests than these did with each other, as shown by data on the 
relative disparity between scores of the same pupils in two different 
tests, (b) The Otis test exhibited less variability in the deviations 
of individual scores from a deviation-value assumed to be true. 
These results show that the Otis test occupies an intermediate 
position between the other two tests; that is, it approximates 
more closely the average results from the three tests than do either 
of the other tests. On the principle that the average of several 
expert attempts to measure an object is probably a more reliable 
measure of the object than any single measure, this situation seems 
to be decidedly in favor of the Otis test. The use, however, of 
these and other group intelligence tests in a more extensive com- 
parative study may establish a different position for the Otis test 
with reference to the series of true measures. Until a different 
position is experimentally established, the Otis test should be 
regarded as the most reliable of the three instruments for measur- 
ing the general intelligence of high-school Freshmen. 

[To be concluded] 



